{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(time-series)=\n",
    "# Time Series\n",
    "\n",
    "In this chapter, we'll look at time series. If you haven't yet looked at the two sections on **pandas**, the [Data Quickstart](data-quickstart) and [Working with Data](working-with-data) chapters, it might be worth taking a quick spin through them first. You may also find it useful to be familiar with a few of the concepts from the previous [chapter on time](time-intro).\n",
    "\n",
    "While we'll cover the basics here, the full set of time series functionality of **pandas** can be [found here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html).\n",
    "\n",
    "This chapter has benefitted from the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas, Tom Augspurger's [Effective Pandas](https://github.com/TomAugspurger/effective-pandas), and [Applied Time Series for Fisheries and Environmental Sciences](https://nwfsc-timeseries.github.io/atsa-labs/).\n",
    "\n",
    "Let's imports a few of the packages we'll need first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import statsmodels.api as sm\n",
    "from statsmodels.graphics import tsaplots\n",
    "from rich import inspect\n",
    "from numpy.random import Generator, PCG64\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Plot settings\n",
    "plt.style.use(\n",
    "    \"https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt\"\n",
    ")\n",
    "# Set max rows displayed for readability\n",
    "pd.set_option(\"display.max_rows\", 6)\n",
    "\n",
    "# Set seed for random numbers\n",
    "seed_for_prng = 78557\n",
    "prng = Generator(PCG64(seed_for_prng))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introducing Time Series with **pandas**\n",
    "\n",
    "[**pandas**](https://pandas.pydata.org/) is the workhorse of time series analysis in Python. The basic object is a *timestamp*. The `pd.to_datetime` function creates timestamps from strings that could reasonably represent datetimes. Let's see an example of using `pd.to_datetime` to create a timestamp and then inspect all of the methods and attributes of the created timestamp using **rich**'s `inspect` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "date = pd.to_datetime(\"16th of February, 2020\")\n",
    "inspect(date)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is of type `Timestamp` and you can see that it has many of the same properties as the built-in Python `datetime.datetime` class from the previous chapter. As with that, the default setting for `tz` (timezone) and `tzinfo` is `None`. There are some extra properties, though, such as `freq` for frequency, which will be very useful when it comes to manipulating time *series* as opposed to just one or two datetimes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Time Series\n",
    "\n",
    "There are two main scenarios in which you might be creating time series using **pandas**: i) creating one from scratch or ii) reading in data from a file. Let's look at a few ways to do i) first. \n",
    "\n",
    "You can create a time series with **pandas** by taking a date as created above and extending it using **pandas** timedelta function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "date + pd.to_timedelta(np.arange(12), \"D\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This has created a datetime index of type `datetime65[ns]` (remember, an index is a special type of **pandas** column), where \"ns\" stands for nano-second resolution.\n",
    "\n",
    "Another method is to create a range of dates (pass a frequency using the `freq=` keyword argument):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.date_range(start=\"2018/1/1\", end=\"2018/1/8\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another way to create ranges is to specify the number of periods and the frequency:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.date_range(\"2018-01-01\", periods=3, freq=\"H\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following the discussion of the previous chapter on timezones, you can also localise timezones directly in **pandas** dataframes:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dti = pd.date_range(\"2018-01-01\", periods=3, freq=\"H\").tz_localize(\"UTC\")\n",
    "dti.tz_convert(\"US/Pacific\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's see how to turn data that has been read in with a non-datetime type into a vector of datetimes. This happens *all the time* in practice. We'll read in some data on job vacancies for information and communication jobs, ONS code UNEM-JP9P, and then try to wrangle the given \"date\" column into a **pandas** datetime column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "url = \"https://api.ons.gov.uk/timeseries/JP9P/dataset/UNEM/data\"\n",
    "\n",
    "# Get the data from the ONS API:\n",
    "df = pd.DataFrame(pd.json_normalize(requests.get(url).json()[\"months\"]))\n",
    "df[\"value\"] = pd.to_numeric(df[\"value\"])\n",
    "df = df[[\"date\", \"value\"]]\n",
    "df = df.rename(columns={\"value\": \"Vacancies (ICT), thousands\"})\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have the data in. Let's look at the column types that arrived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the default 'object' type, but we want the date column to have `datetime64[ns]`, which is a datetime type. Again, we use `pd.to_datetime`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"date\"] = pd.to_datetime(df[\"date\"])\n",
    "df[\"date\"].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, the conversion from the format of data that was put in of \"2001 MAY\" to datetime worked out-of-the-box. `pd.to_datetime` will always take an educated guess as to the format, but it won't always work out.\n",
    "\n",
    "What happens if we have a more tricky-to-read-in datetime column? This frequently occurs in practice so it's well worth exploring an example. Let's create some random data with dates in an unusual format with month first, then year, then day, eg \"1, '19, 29\" and so on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "small_df = pd.DataFrame({\"date\": [\"1, '19, 22\", \"1, '19, 23\"], \"values\": [\"1\", \"2\"]})\n",
    "small_df[\"date\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if we were to run this via `pd.to_datetime` with no further input, it would misinterpret, for example, the first date as `2022-01-19`. So we must pass a bit more info to `pd.to_datetime` to help it out. We can pass a `format=` keyword argument with the format that the datetime takes. Here, we'll use `%m` for month in number format, `%y` for year in 2-digit format, and `%d` for 2-digit day. We can also add in the other characters such as `'` and `,`. You can find a list of datetime format identifiers in the previous chapter or over at [https://strftime.org/](https://strftime.org/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.to_datetime(small_df[\"date\"], format=\"%m, '%y, %d\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Datetime Offsets\n",
    "\n",
    "Our data, currently held in `df`, were read in as if they were from the *start* of the month but these data refer to the month that has passed and so should be for the *end* of the month. Fortunately, we can change this using a time offset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"date\"] = df[\"date\"] + pd.offsets.MonthEnd()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While we used the `MonthEnd` offset here, there are many different offsets available. You can find a [full table of date offsets here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a datetime index and setting the frequency\n",
    "\n",
    "For the subsequent parts, we'll set the datetime column to be the index of the dataframe. *This is the standard setup you will likely want to use when dealing with time series.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.set_index(\"date\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if we look at the first few entries of the index of dataframe (a datetime index) using `head` as above, we'll see that the `freq=` parameter is set as `None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.index[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This can be set for the whole dataframe using the `asfreq` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.asfreq(\"M\")\n",
    "df.index[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although most of the time it doesn't matter about the fact that `freq=None`, some aggregation operations need to know the frequency of the time series in order to work and it's good practice to set it if your data *are* regular. You can use `asfreq` to go from a higher frequency to a lower frequency too: the last entry from the higher frequency that aligns with the lower frequency will be taken, for example in going from months to years, December's value would be used.\n",
    "\n",
    "Note that trying to set the frequency when your datetime index doesn't match up to a particular frequency will cause errors or problems.\n",
    "\n",
    "A few useful frequencies to know about are in the table below; all of these can be used with `pd.to_datetime` too.\n",
    "\n",
    "| Code  | Represents                                                          |\n",
    "|-------|---------------------------------------------------------------------|\n",
    "| D     | Calendar day                                                        |\n",
    "| W     | Weekly                                                              |\n",
    "| M     | Month end                                                           |\n",
    "| Q     | Quarter end                                                         |\n",
    "| A     | Year end                                                            |\n",
    "| H     | Hours                                                               |\n",
    "| T     | Minutes                                                             |\n",
    "| S     | Seconds                                                             |\n",
    "| B     | Business day                                                        |\n",
    "| BM    | Business month end                                                  |\n",
    "| BQ    | Business quarter end                                                |\n",
    "| BA    | Business year end                                                   |\n",
    "| BH    | Business hours                                                      |\n",
    "| MS    | Month start                                                         |\n",
    "| QS    | Quarter start                                                       |\n",
    "| W-SUN | Weeks beginning with Sunday (similar for other days)                |\n",
    "| 2M    | Every 2 months (works with other combinations of numbers and codes) |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making Quick Time Series Plots\n",
    "\n",
    "Having managed to put your time series into a dataframe, perhaps converting a column of type string into a colume of type datetime in the process, you often just want to see the thing! We can achieve this using the `plot` command, as long as we have a datetime index.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Resampling, Rolling, and Shifting\n",
    "\n",
    "Now our data have a *datetime index*, some common time series operations are made very easy for us.\n",
    "\n",
    "### Resampling\n",
    "\n",
    "Quite frequently, there is a situation in which one would like to change the frequency of a given time series. A time index-based dataframe makes this easy via the `resample` function. `resample` must be told *how* you'd like to resample the data, for example via the mean or median. Here's an example resampling the monthly data to annual and taking the mean:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.resample(\"A\").mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As resample is just a special type of aggregation, it can work with all of the usual functions that aggregations do, including in-built functions or user-defined functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.resample(\"5A\").agg([\"mean\", \"std\"]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Resampling can go up in frequency (up-sampling) as well as down, but we no longer need to choose an aggregation function, we must now choose how we'd like to fill in the gaps for the frequencies we didn't have in the original data. In the example below, they are just left as NaNs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.resample(\"D\").asfreq()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Options to fill in missing time series data include using `bfill` or `ffill` to fill in the blanks based on the next or last available value, respectively, or `interpolate` (note how only the first 3 NaNs are replaced using the `limit` keyword argument):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.resample(\"D\").interpolate(method=\"linear\", limit_direction=\"forward\", limit=3)[:6]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the differences between the filling methods more clearly in this stock market data, following a chart by Jake Vanderplas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get stock market data\n",
    "import pandas_datareader as web\n",
    "\n",
    "xf = web.DataReader(\"AAPL\", \"stooq\", start=\"2017-01-01\", end=\"2019-06-01\")\n",
    "xf = xf.sort_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "data = xf.iloc[:10, 3]\n",
    "cycle = ax._get_lines.prop_cycler\n",
    "\n",
    "data.asfreq(\"D\").plot(ax=ax, marker=\"o\", linestyle=\"None\", zorder=3)\n",
    "data.asfreq(\"D\", method=\"bfill\").plot(\n",
    "    ax=ax, style=\"-.o\", lw=1, color=next(cycle)[\"color\"]\n",
    ")\n",
    "data.asfreq(\"D\", method=\"ffill\").plot(\n",
    "    ax=ax, style=\"--o\", lw=1, color=next(cycle)[\"color\"]\n",
    ")\n",
    "ax.set_ylabel(\"Close ($)\")\n",
    "ax.legend([\"original\", \"back-fill\", \"forward-fill\"]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rolling Window Functions\n",
    "\n",
    "The `rolling` and `ewm` methods are both rolling window functions. The first includes functions of the sequence\n",
    "\n",
    "$$\n",
    "y_t = f(\\{x_{t-i} \\}_{i=0}^{i=R-1})\n",
    "$$\n",
    "\n",
    "where $R$ is the number of periods to use for the rolling window. For example, if the function is the mean, then $f$ takes the form $\\frac{1}{R}\\displaystyle\\sum_{i=0}^{i=R-1} x_{t-i}$.\n",
    "\n",
    "The example below is a 2-period rolling mean:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.rolling(2).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ewm` includes the class of functions where data point $x_{t-i}$ has a weight $w_i = (1-\\alpha)^i$. As $0 < \\alpha < 1$, points further back in time are given less weight. For example, an exponentially moving average is given by\n",
    "\n",
    "$$\n",
    "y_t = \\frac{x_t + (1 - \\alpha)x_{t-1} + (1 - \\alpha)^2 x_{t-2} + ...\n",
    "+ (1 - \\alpha)^t x_{0}}{1 + (1 - \\alpha) + (1 - \\alpha)^2 + ...\n",
    "+ (1 - \\alpha)^t}\n",
    "$$\n",
    "\n",
    "The example below shows the code for the exponentially weighted moving average:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.ewm(alpha=0.2).mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see these methods together on the stock market data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "roll_num = 28\n",
    "alpha = 0.03\n",
    "xf[\"Close\"].plot(label=\"Raw\", alpha=0.5)\n",
    "xf[\"Close\"].expanding().mean().plot(label=\"Expanding Average\", style=\":\")\n",
    "xf[\"Close\"].ewm(alpha=alpha).mean().plot(\n",
    "    label=f\"EWMA ($\\\\alpha=${alpha:.2f})\", style=\"--\"\n",
    ")\n",
    "xf[\"Close\"].rolling(roll_num).mean().plot(label=f\"{roll_num}D MA\", style=\"-.\")\n",
    "ax.legend()\n",
    "ax.set_ylabel(\"Close ($)\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more tools to analyse stocks, see the [**Pandas TA**](https://twopirllc.github.io/pandas-ta/) package.\n",
    "\n",
    "We can also use `rolling` as an intermediate step in creating more than one type of aggregation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "roll = xf[\"Close\"].rolling(50, center=True)\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "m = roll.agg([\"mean\", \"std\"])\n",
    "m[\"mean\"].plot(ax=ax)\n",
    "ax.fill_between(m.index, m[\"mean\"] - m[\"std\"], m[\"mean\"] + m[\"std\"], alpha=0.2)\n",
    "ax.set_ylabel(\"Close ($)\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shifting\n",
    "\n",
    "Shifting can move series around in time; it's what we need to create leads and lags of time series. Let's create a lead and a lag in the data. Remember that a lead is going to shift the pattern in the data to the left (ie earlier in time), while the lag is going to shift patterns later in time (ie to the right)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lead = 12\n",
    "lag = 3\n",
    "orig_series_name = df.columns[0]\n",
    "df[f\"lead ({lead} months)\"] = df[orig_series_name].shift(-lead)\n",
    "df[f\"lag ({lag} months)\"] = df[orig_series_name].shift(lag)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.iloc[100:300, :].plot();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Correlations\n",
    "\n",
    "In this section, we'll look at some different ways of comparing time series using simple metrics.\n",
    "\n",
    "### Autocorrelation\n",
    "\n",
    "First, what about comparing a time series to itself? The autocorrelation of a time series tells you how persistent it is. The econometrics package [**statsmodels**]() has some tools for this, most notably `statsmodels.tsa.stattools.acf`. Sometimes what you want is just a visual cue though, in which case the code below produces a nice chart."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = tsaplots.plot_acf(df[\"Vacancies (ICT), thousands\"], lags=24)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rememebr that the autocorrelation can tell you about likely terms to include in a moving average (MA) model.\n",
    "\n",
    "We can also look at the *partial autocorrelation*, which, at a given lag, is the correlation that results after removing the effect of any correlations due to the terms at shorter lags. It can be computer via the `statsmodels.tsa.stattools.pacf` function, but here's a chart:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = tsaplots.plot_pacf(df[\"Vacancies (ICT), thousands\"], lags=24)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that the partial autocorrelation function can tell you about likely terms to include in an autoregressive (AR) model.\n",
    "\n",
    "By looking at both the ACF and PACF, you can get an idea of what terms would be good to include in *both* an MA and AR model which, combined, make an ARMA model. It's also worth noting that, if the ACF and PACF do not seem to tail off, the series is likely to be non-stationary and differencing will be needed if you are going to model it with an ARMA approach."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Correlations with Other Series\n",
    "\n",
    "To compute correlations with other time series, a popular choice is the cross-correlation function (CCF). Confusingly, there are different definitions of this object in statistics and signal processing. In statistics, it's given by\n",
    "\n",
    "$$\n",
    "r_k = \\frac{g_k}{\\sqrt{\\sigma_x\\sigma_y}};\n",
    "$$\n",
    "\n",
    "where\n",
    "\n",
    "$$\n",
    "g_k = \\frac{1}{n}\\sum_{t=1}^{n-k} \\left(y_t-\\bar{y}\\right) \\left(x_{t+k}-\\bar{x}\\right),\n",
    "$$\n",
    "\n",
    "and $k$ is set by a rule of thumb, usually something like $10 \\log_{10}\\left(\\min(|x|, |y|)/2\\right)$. $g_k$ is the 'sliding' dot product of $y$, the response, with $x$, the predictor, up to the $k$ th lag.\n",
    "\n",
    "The default cross-correlation function in **statsmodels** is different. It only looks forward, and it computes *all* possible lags. You can call it via `sm.ccf` if you have already run `import statsmodels.api as sm`.\n",
    "\n",
    "We'll use the standard stats treatment here, but we'll need to define a function to do the computation we need. Then we'll apply the CCF to a classic example: the response of the number of Lynx trapped (in Canada) to sunspot activity.\n",
    "\n",
    "First, defining the function. We'll make use of the signal processing `ss.correlate` function but modify it so that adopts the definition of CCF used in statistics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import scipy.signal as ss\n",
    "\n",
    "\n",
    "def ccf_stats(x, y, lag_max=None):\n",
    "    if not lag_max:\n",
    "        lag_max = int(10 * np.log10(min(len(x), len(y)) / 2))\n",
    "    result = ss.correlate(x - np.mean(x), y - np.mean(y), method=\"direct\") / (\n",
    "        np.std(y) * np.std(x) * len(y)\n",
    "    )\n",
    "    length = (len(result) - 1) // 2\n",
    "    lo = length - lag_max\n",
    "    hi = length + (lag_max + 1)\n",
    "    return result[lo:hi]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's grab the two different datasets and merge them, taking a quick look while we're at it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lynx_df = pd.read_csv(\n",
    "    \"https://vincentarelbundock.github.io/Rdatasets/csv/datasets/lynx.csv\", index_col=0\n",
    ")\n",
    "spot_df = pd.read_csv(\n",
    "    \"https://vincentarelbundock.github.io/Rdatasets/csv/datasets/sunspot.year.csv\",\n",
    "    index_col=0,\n",
    ")\n",
    "df = pd.merge(lynx_df, spot_df, on=\"time\", how=\"inner\", suffixes=(\"_lynx\", \"_sunspots\"))\n",
    "df = df.set_index(\"time\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot(subplots=True, ylim=(0, None));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use our defined function on the data. We'll add in a rough confidence interval of $\\pm \\frac{2}{\\sqrt{n}}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots()\n",
    "ccf_res_stats = ccf_stats(df[\"value_sunspots\"], np.log(df[\"value_lynx\"]))\n",
    "conf_int = 2.0 / np.sqrt(len(df))\n",
    "ax.stem(\n",
    "    range(-int(len(ccf_res_stats) / 2), int(len(ccf_res_stats) / 2) + 1),\n",
    "    ccf_res_stats,\n",
    "    basefmt=\" \",\n",
    "    markerfmt=\" \",\n",
    ")\n",
    "ax.axhline(-conf_int, ls=\"--\", lw=0.9, color=\"k\")\n",
    "ax.axhline(conf_int, ls=\"--\", lw=0.9, color=\"k\")\n",
    "ax.axhline(0, color=\"k\", lw=1)\n",
    "ax.set_ylabel(\"Cross-correlation\")\n",
    "ax.set_xlabel(\"Lag\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The significant correlations at lags of -3 to -5 seem to suggest that lynx numbers fall in the 3 to 5 years after high sunspot activity."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transformations\n",
    "\n",
    "There are many reasons why you might want to perform transformations on series, including to normalise data and to adjust for changes (such as inflation). In this section, we'll look at some of the most common transforms for time series.\n",
    "\n",
    "### Mathematical Transformations\n",
    "\n",
    "The usual transformations that we know and love from other series of numerical data can equally be applied to time series, for example the log and power transforms. Let's see some of these in action using data on US GDP."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas_datareader.data as web\n",
    "from datetime import datetime\n",
    "\n",
    "ts_start_date = pd.to_datetime(\"2009-01-01\")\n",
    "\n",
    "df = web.DataReader(\"NA000334Q\", \"fred\", start=ts_start_date, end=datetime.now())\n",
    "df.columns = [\"gdp\"]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create new columns for each of log, square, and square root transforms. All of these mathematical transformations can be found in **numpy**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"log\"] = np.log(df[\"gdp\"])\n",
    "df[\"square\"] = np.power(df[\"gdp\"], 2)\n",
    "df[\"sqrt\"] = np.sqrt(df[\"gdp\"])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A very common transformation is to change to growth space. In time series maths, this is $\\frac{y_t}{y_{t-1}}$ for period-on-period growth or, say, $\\frac{y_t}{y_{t-4}}$ for growth relative to 4 periods ago.\n",
    "\n",
    "Let's add in quarter-on-quarter-a-year-ago and quarter-on-quarter growth rates to our growing list of transforms:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"q_on_q\"] = df[\"gdp\"].pct_change(1) * 100\n",
    "df[\"q_on_q_yr_ago\"] = df[\"gdp\"].pct_change(4) * 100\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another classic transform you might need is to de-mean your data. We are transforming such that\n",
    "\n",
    "$$\n",
    "y_t' = y_t - \\langle y_t \\rangle\n",
    "$$\n",
    "\n",
    "where the angular brackets denote a mean. Here's an example of that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"demean\"] = df[\"gdp\"] - df[\"gdp\"].mean()\n",
    "df[[\"gdp\", \"demean\"]].tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Be wary though! The mean is a function that, by default, uses all time periods, so you need to be careful if you're working on a real-time data problem such as forecasting. In this case you only want to use the mean *up to that point in time* to demean a series. Mathematically, this is\n",
    "\n",
    "$$\n",
    "y_t' = y_t - \\frac{1}{t+1}\\displaystyle{\\sum_{\\tau=0}^{\\tau=t} y_\\tau}\n",
    "$$\n",
    "\n",
    "To do this, we can make use of the **pandas** *expanding* functionality we saw earlier. Here's an example of real-time demeaning:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"real_time_demean\"] = df[\"gdp\"] - df[\"gdp\"].expanding().mean()\n",
    "df[[\"gdp\", \"real_time_demean\"]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how the second real-time demeaned entry is the second entry for GDP minus the average of the second and first entries of GDP. A more general way of avoiding data leakage is to use **scikit-learn**'s [pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) functionality.\n",
    "\n",
    "Let's move on to another classic transformation: differences. These are given by\n",
    "\n",
    "$$\n",
    "\\Delta y_t = y_t - y_{t-1}\n",
    "$$\n",
    "\n",
    "Note that, in this case, the first entry in the differences won't be defined (ie appears as `NaN`) because there is no $\\text{gdp}_{t=-1}$. Use `diff(n)` to get $y_t - y_{t-n}$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"difference\"] = df[\"gdp\"].diff()\n",
    "df[[\"gdp\", \"difference\"]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The Box-Cox Transform\n",
    "\n",
    "This transform actually nests power transforms and logarithms, depending on the value of a parameter that is usually denoted $\\lambda$. The transform is given by\n",
    "\n",
    "$$\n",
    "y_t'  =\n",
    "    \\begin{cases}\n",
    "      \\ln(y_t) & \\text{if $\\lambda=0$};  \\\\\n",
    "      (y_t^\\lambda-1)/\\lambda & \\text{otherwise}.\n",
    "    \\end{cases}\n",
    "$$\n",
    "\n",
    "How do you choose $\\lambda$? One way is to choose such that the transformed data resemble a normal distribution. Let's see a couple of examples, including an optimal value for $\\lambda$ as determined by **scipy**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy import stats\n",
    "\n",
    "data = prng.exponential(size=1000)\n",
    "# Last entry of the list gets the optimal value for lambda\n",
    "λ_values = [0, 0.6, stats.boxcox(data)[1]]\n",
    "# One extra plot for the original data\n",
    "num_plots = len(λ_values) + 1\n",
    "# For fitting a normal distribution\n",
    "x = np.linspace(-5, 5, 100)\n",
    "bins = 30\n",
    "\n",
    "\n",
    "fig, axes = plt.subplots(ncols=num_plots // 2, nrows=num_plots - num_plots // 2)\n",
    "axes = axes.flatten()\n",
    "for i, ax in enumerate(axes[1:]):\n",
    "    values = stats.boxcox(data, lmbda=λ_values[i])\n",
    "    ax.hist(values, density=True, bins=bins)\n",
    "    title = r\"$\\lambda =$\" + f\"{λ_values[i]}\"\n",
    "    if i == len(λ_values) - 1:\n",
    "        title = r\"$\\lambda^*=$\" + f\"{λ_values[i]:.2f}\"\n",
    "    ax.set_title(title, size=12)\n",
    "    mu, std = stats.norm.fit(values)\n",
    "    p = stats.norm.pdf(x, mu, std)\n",
    "    ax.plot(x, p, linewidth=2, alpha=0.9)\n",
    "axes[0].hist(data, density=True, bins=bins)\n",
    "axes[0].set_title(\"Original\", size=12)\n",
    "[ax.set_xlim(-4, 4) for ax in axes]\n",
    "plt.suptitle(\"Box Cox Transformed Distributions\", size=18)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can reverse the Box Cox transform using **scipy**'s `scipy.special.inv_boxcox(y, lmbda)` function. This is useful if you have performed an operation on the transformed data, for example a forecast, and want to transform the outputs back into the original space."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adjusting for Inflation\n",
    "\n",
    "This one's mighty important! As every economist knows, the value of money changes over time: a dollar in 1963 does not buy the same bundle of goods and services as a dollar in 2021 would. The price level has changed considerably between those two times. Let's say we have prices in year $t$ given by $y_t$. Then the price of that same good or service in year $\\omega$ is\n",
    "\n",
    "$$\n",
    "y_\\omega = y_t \\frac{\\pi_{\\omega}} {\\pi_t}\n",
    "$$\n",
    "\n",
    "for example, to find the price of a television in 2021 dollars using data on televisions from 2000 we would do $y_{2021} = y_{2000} \\cdot \\frac{\\pi_{2021}}{\\pi_{2000}}$. Note that $\\pi$ here is being used as a symbol for inflation, *not* the ratio of a circle's circumference to its diameter in Euclidean space!\n",
    "\n",
    "Let's see a real example using data on an important good, cheese. We'll grab the average price per pound of cheese in US cities and adjust it to be in 2012 dollars using the PCE (personal consumption expenditure) price index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fred_codes = {\n",
    "    \"PCEPI\": \"inflation\",\n",
    "    \"APU0000710212\": \"cheddar\",  # Avg price per pound in US cities\n",
    "}\n",
    "\n",
    "\n",
    "cdf = (\n",
    "    pd.concat(\n",
    "        [\n",
    "            web.DataReader(\n",
    "                x,\n",
    "                \"fred\",\n",
    "                start=pd.to_datetime(\"2000-01-01\"),\n",
    "                end=pd.to_datetime(\"2021-01-01\"),\n",
    "            )\n",
    "            for x in fred_codes.keys()\n",
    "        ],\n",
    "        axis=1,\n",
    "    )\n",
    "    .rename(columns=fred_codes)\n",
    "    .groupby(pd.Grouper(freq=\"A\"))\n",
    "    .mean()\n",
    ")\n",
    "cdf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cdf[\"cheddar_i_adjust\"] = (\n",
    "    cdf.loc[\"2021-12-31\", \"inflation\"] * cdf[\"cheddar\"] / cdf[\"inflation\"]\n",
    ")\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(cdf.index, cdf[\"cheddar\"], label=\"Cheddar (nominal USD)\")\n",
    "ax.plot(cdf.index, cdf[\"cheddar_i_adjust\"], label=\"Cheddar (2021 USD)\", linestyle=\"--\")\n",
    "ax.set_ylabel(\"Cost per pound (USD)\")\n",
    "ax.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see from the chart that, once adjusted for inflation, cheese was not such a great bargain in the noughties as it appeared from the nominal cost per pound. But the big fall in cheese prices between 2013 and 2017 *was* a real effect and coincided with the United States' [great cheese surplus](https://www.vox.com/2016/10/13/13268980/cheese-glut-united-states). Supply and demand work! Of course, the costs are the same in 2021 dollars."
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "671f4d32165728098ed6607f79d86bfe6b725b450a30021a55936f1af379a247"
  },
  "kernelspec": {
   "display_name": "Python 3.8.6 64-bit ('codeforecon': conda)",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
